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Abstract 

This paper addresses the problem of inferring sparse causal networks modeled by multivariate auto-regressive (MAR) processes. 
Conditions are derived under which the Group Lasso (gLasso) procedure consistently estimates sparse network structure. The key 
condition involves a "false connection score" ^jj. In particular, we show that consistent recovery is possible even when the number 
of observations of the network is far less than the number of parameters describing the network, provided that tp < 1. The false 
connection score is also demonstrated to be a useful metric of recovery in non-asymptotic regimes. The conditions suggest a 
' modified gLasso procedure which tends to improve the false connection score and reduce the chances of reversing the direction of 

causal influence. Computational experiments and a real network based electrocorticogram (ECoG) simulation study demonstrate 
pH . the effectiveness of the approach. 



- 

X 



L Introduction 



The problem of inferring networks of causal relationships arises in biology, sociology, cognitive science and engineering. 
Specifically, suppose that we are able to observe the dynamical behaviors of N individual components of a system and that 
some, but not necessarily all, of the components may be causally influencing each other We will refer to such a system as a 
causal network. To emphasize the network-centric viewpoint, we will use the terms node and network, instead of component 
and system, respectively. Causal network inference is the process of identifying the significant causal influences by observing 
the time-series at the nodes. For example, in electrocorticography (ECoG) the electrical signals in the brain are recorded directly 
^ , ' and a goal is to identify the direction of information flow from one brain region to another 

One common tool for modeling causal influences is the multivariate autoregressive (MAR) model |[T]-||3]. MAR models 
assume that the current measurement at a given node is a linear combination of the previous p measurements at all TV nodes, 
K*" plus an innovation noise: 

!>■ ; x(i) = 2J A,.x(t - r) + u(t) (1) 



where x(t) = [a;i(i) X2{t) ... XN{t)\ is a vector of signal measurements across all N nodes at time t, matrices 
A,. = {o-ijii^y} contain autoregressive coefficients describing the influence of node j on node i at a delay of r time samples, 
and u(t) ~ U2{t) ... WAr(<)] ^ A/'(0,S) is innovation noise. The MAR model is especially conducive to the 

assessment of Granger Causality, where time series Xj is said to Granger-cause Xi if knowledge of the past of Xj improves 
the prediction of xi compared to using only the past of |j4l. 

The MAR model in Eq. ([T]) allows for the possibility of a fully connected network in which every node causally influences 



, every other node. This flexibility is somewhat unrealistic and leads to practical challenges. In many networks each node is 
directly influenced by only a small subset of other nodes. The MAR model is overparameterized in such cases. This leads to 
serious practical problems. It may be impossible to reliably infer the network from noisy, finite-length time-series because of the 
large number of unknown coefficients in overparameterized models. We define the Sparse MAR Time-series (SMART) model 
to have the same form as Eq. ([T]i but include an extra parameter iSaci™ denoting the index pairs of non-zero causal influences 
to eliminate overparameterization. For example, if node j influences node i, then (i, j) G otherwise ^ 5^c,i,c and 

J (r) — for all time indices r. The SMART model for node i is given by; 

p 

Xi{t) = Ui{t) + ^ ^ a,.j (r)xj(t - r) (2) 

j:(lj)e5,a™ r=l 

Applying Eq. (|2]i to each node i = l,2,...,A^in turn gives the SMART model for the whole network. 

If the cardinality of the active set, denoted |5„c,ivc|, is equal to N'^, then the SMART model is equivalent to the MAR model. 
We are primarily interested in networks for which |iSin,ivc| < niN, for some constant m > 1. In such cases, the main inference 
challenge is reliably identifying the set 5.,„ivc, since once this is done the task of estimating the SMART coefficients is a simple 



This work supported in part by the NIBIB under NIH awards EB005473, EB009749 and by tlie AFOSR award FA9550-09-1-0I40. 

Tliis work is sponsored by the department of the Air Force under contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations 
are those of the author and are not necessarily endorsed by the United States Government. 



2 



and classical problem. In general, the amount of data required to reliably estimate SMART coefficients decreases as 
decreases. 

Identifying S^^wc is a subset selection problem. Simple subset selection problems can be solved using the well-known Lasso 
procedure. The Lasso mixes an £2 norm on the residual error with an £1 norm penalty on the regression coefficients favoring 
a solution in which most coefficients are zero ||5]. However, ordinary Lasso does not capture the group structure of sparse 
connections in the SMART model. The Group Lasso (gLasso) procedure was first proposed by 16] in a general setting to 
promote group-structured sparsity patterns. gLasso penalties have recently been proposed for source localization in magneto- 
/electroencephalography (M/EEG) ll7l- lfT2l . as well as for identifying interaction patterns in the human brain [fT3l and in gene 
regulatory networks [14!. In both |[T3l and l[T4l the gLasso is effectively applied to SMART model estimation by penalizing 
the sum of £2 norms of the coefficients of each network link {£1 norm of £2 norms). We study estimation consistency of this 
technique which we term the SMART gLasso or SG. 

Our main contribution is a novel characterization of the special conditions needed for consistency of the SG. These conditions 
are described in Section |lll] Existing gLasso consistency results do not apply to the temporal structure in the SMART model. 
The SG consistency conditions are similar in spirit to the standard "incoherence" conditions encountered in the analysis of 
Lasso and its variants 1 15] , but are fundamentally different because of the autoregressive structure of our model. We define 
the "false connection score" and show that it yields a condition for consistent estimation of the underlying SMART sparsity. 
If this score is below one, then the network connectivity pattern can be recovered with high probability in the limit as the 
size of the network and the number of samples tends to infinity (although the number of samples can grow much slower than 
the network size). Conversely, if this score is above one, than an estimate that identifies all the correct connections will also 
include at least one false positive with high probability. 

We also propose a variant of the SG in Section which does not penalize self-connections (i.e., each node is free to 
influence itself). We call this variant Self-Connected SMART gLasso (SCSG) and show that it typically results in a lower false 
connection score for SMART models. We provide some example networks as well as their false connection scores for the 
SMART gLasso and SCSG approaches in Sec. [V] We demonstrate the effectiveness of our results by simulating a variety of 
networks in Sec. IVII We also apply our results to a realistic brain network in Sec. IVIII bv simulating the sparse connectivity 
pattern observed in the macaque brain. 

II. Graph Inference with Lasso-Type Procedures 
In this section we introduce the Lasso, gLasso, SG, and SCSG, and discuss previous consistency results. 



A. Lasso and gLasso 

Tibshirani first proposed the Least Absolute Shrinkage and Selection Operator (Lasso) in 1996 to "retain the good features of 
both subset selection and ridge regression" ||5]- Although originally stated as an £1 norm constrained least squares optimization, 
the Lasso can also be stated as an unconstrained mixed-norm minimization. We consider the unconstrained problem throughout: 

gLasso ^ argmin -lly - Xall^ + Allalli (3) 
ot n 

Here it is assumed that measured length n vector y is the result of a sparse linear combination of columns of X; i.e. y = Xa for 
sparse vector a. The first term of (|3) penalizes solutions which do not fit the measured data well, while the second term favors 
solution which are sparse. Yuan and Lin ||6] introduced the Group Lasso (gLasso) extension to Tibshirani's Lasso in 2006. 
While the Lasso penalizes the ii norm of the coefficient vector, the gLasso divides the coefficient vector into predetermined 
sub-vectors and penalizes the sum of the £2 norms of the sub-vectors; i.e., the £1 norm of £2 norms: 

2 

N 

+ A5Il|a42 (4) 

1=1 

2 

Such a penalty is beneficial when each group of coefficients is beUeved to be either all zero or all non-zero, and the solution 
contains only a small number of nonzero coefficient groups, e.g., ll7l- lfT2l . 

Solving the SMART model subset selection problem with the gLasso leads to the SG estimate: 

1 2 ^ 

af^ = argmin - j|y,; - Xaij|2 + A !|a,;j II2 (5) 



a- 



gLasso 



arg mm — 
an 



where we define: 



3 



yi = 



X, = 



a. 





x,{t 


Xi{t 


-1) 


Xi[t 


- ^1 


Xi{t - 


- n) 




X2 . 


ai,j{l 


) «i, 


ai,i 


ai,2 



Xi{t n 

Xi{t-p) 

Xi{t-p~ 1) 



:(2) 



ai.Af] 



1). 



^SCSG ^ argmin -||y, - XaJ^ 

ai 77, 



The SCSG removes the penalty for self-connections, that is, each node's own past values are allowed to predict its current 
value without a penalty: 

A^||a,,,||2 (6) 

This represents the expectation of sparse connectivity between nodes. 

The gLasso optimization falls into a class of well-studied convex optimization problems. Many algorithms have been proposed 
for solving this sort of problem (see fll61 for a description and comparison of several approaches). Greedy procedures, such as 
group orthogonal matching pursuit, have been proposed as well I17j|. The choice of optimization algorithm is not an important 
concern in this paper; rather the main contribution of this paper is to characterize the behavior and consistency of the solution 
of Eqs. ^ and 



B. Graphical Model Identification 

Lasso-like algorithms have found application in high dimensional graphical model identification. The seminal work in this 
area was done by Meinshausen and Biihlmann ifTSl who consider estimating the structure of sparse Gaussian graphical models 
by identifying the nonzero entries of the inverse covariance matrix. They consider an undirected graph where each vertex 
represents a variable and edges represent conditional dependence between two variables given all other variables. Conditionally 
independent variables do not share an edge and correspond to a zero entry in the inverse covariance matrix. Identifying the edge 
set, or nonzero entries in the inverse covariance matrix, is achieved by writing independent samples of one variable as a sparse, 
but unknown linear combination of the corresponding samples of the other variables, then using the Lasso. Meinshausen and 
Biihlmann flSl show that this procedure consistently identifies the edge set even when the number of variables (vertices) grows 
faster than the number of samples. Ravikumar, et al., |fT9l propose an alternative Lasso like approach to the same problem 
by maximizing the £i norm penalized log-likelihood function. In this case the first term of Eq. (O is replaced with an inner 
product and log-determinant of the covariance matrix. The graphical lasso technique solves this type of problem efficiently for 
very large problems ll20l . 




(a) SMART Model Temporal Depiction (b) SMART Model Network 

Depiction 



Fig. 1. Two graphical depictions of a two node, second order SMART model, (a) Explicit time dependence structure, (b) Shorthand depiction of (a) 
suppressing time and self-connections. 



The SMART model is a graphical model involving causal relationships and consequently, an element of time. The resulting 
model is a directed graph, and each node can be represented by multiple vertices: one for the current value, and potentially 
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infinitely many for past values at that node as shown in Fig. |l(a)| To ease visualization, we suppress time dependence and 
illustrate causal influence with a single arrow linking one vertex per node as shown in Fig. |l(b)| Here we have not shown 
self-connections. Nodes which have a causal influence are termed "parent nodes" (node 2 in Fig.!!) and the nodes they influence 
"child nodes" (node 1 in Fig.[T]i. Given that graphs representing MAR models are directed, the existing analyses by Ravikumar, 
et al., ||T9| and Meinshausen and Biihlmann ifTSll are insufficient. The additional notions of causality and a temporal element 
place the SMART model in the realm of graphical Granger models lfT4l . ET\ . 



C. Existing Lasso and gLasso Consistency Results 

There are many existing results on consistency of the Lasso (e.g., ifTSl . Il22l ) and extensions of these to the gLasso or closely 
related problems (e.g., IfTTI . ||23l - ||30l ). An important concept in all these results is mutual incoherence, the maximum absolute 
inner product between two columns of X. Mutual incoherence is extended to grouped variables by using the maximum singular 
value of Xf Xj in place of the vector inner product. Analyzing mutual coherence in the SMART model setting is challenging 
due to the strong statistical dependence between columns of X. Both Lasso and gLasso have recently been successfully applied 
to SMART networks (e.g. ifTSl . lfT4l . ||3T1 . Il32l ). but consistency was not considered. In independent work, the consistency 
of first-order AR models (a special case of the general problem considered here) is investigated in f33\. We identify novel 
incoherence conditions tailored specifically to the SMART model, and show how the network structure of the model affects 
these conditions. Thus these incoherence conditions provide unique insight into the capabilities and limitations of SG model 
identification. 



III. Asymptotic Consistency of SMART gLasso 

In this section we provide sufficient conditions for the asymptotic consistency of the SG estimate assuming the data are 
generated by a SMART model. Our general approach is similar to the style of argument used in the analysis of gLasso 
consistency |[30| and other graph inference methods based on sparse regression ifTSl . An important distinction in SG is the 
MAR structure of the design matrix X. 

Let Si — { j £ {1, ... , N} : {i, j) S iSaa™}, i = 1, . . . , N indicate the subset of nodes that causally influence node i. Define 
X^; and X^c to be submatrices of X composed of the matrices Xj, j € Si and Xj, j ^ Si, respectively. An oracle that 
knows Si does not need to solve the subset selection problem but only a regression problem with design matrix X^^ and 
parameters a.ij, j £ Si. 

Our main result makes use of a regression problem with the same design matrix. Consider a node j with j <^ Si. The 
optimal linear predictor of Xj given X^^ is J^kes- ^k^j,k where the ^j,k minimize IE[||Xj — X^feeS ■^fc'^i.fcllF]- 
stack {'S'j fclfeg^. to form a matrix "i^j.Si, then we can write X]fee5 ^k'^j,k = X^.'S'j^;. Using standard matrix calculus it 
is not difficult to verify that 



where the covariance matrix 

R5„5, -1E[X^,X5,]. 

Recall the following variables: N, the number of nodes in the network; m, the maximum number of parent nodes; p, the 
SMART model order; and n, the number of observations. The main result concerning the consistency of SMART gLasso is 

Theorem 1: Let Cpower, Ccon, Cmin, Cmax, and C/cs be non-negative constants. Assume entries in and the corresponding 
row of each Xj matrix come from independent realizations of the SMART model. Assume the following conditions hold: 

1) Scaling: N, m, and p aie 0{n''), while A is 6(71"^) for different c > with toA^ ~ o(l) and = o(l). 

2) Signal Power: 

max erf = E[x^(i)] < Cpower < 00 

i&{l,...,N} 

3) Connection Strength: min(j^j)g5^^j^^,^ llaij'lh > Ccon > 

4) Minimum Power: max.; ||R^ <;,||2 < C^] < 00 



5) Maximum Cross Correlation: 



max||Rc. 5CII2 < Cmax < 00 



where 

6) False Connection Score: For all € 5£„ 



FC 



keSi 



\\^,kh 



< Cfcs < 1 (7) 
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Then for all n sufficiently large, the set of links identified by SG satisfies S ~ 5,„i„ with probability greater than 1— cxp(— 0(n)); 
i.e., zero and nonzero links identified by SG agree with those of the underlying true model. 

Proof: The proof is presented in Appendix |A] ■ 

Note we have used the following notation: f{n) ~ 0{g{n)) implies < k\g{n)\ for some fc > and large n, 

f{n) = Q{g{n)) implies fci|5(n)| < |/(n)| < k2\g{n)\ for some positive constants fci and k2 and large n, and f{n) ~ o{g{n)) 
implies |/(^^)| < k\g{n)\ for all fc > and large n. 

Assumption [T] specifies how network parameters grow as a function of the number of observations n. It may be possible to 
allow some or all of the constants Cpower, Ccon, Cmin, Cmax, and C/cs to depend on n, but for the purposes of this paper we 
will take these to be constants. The number of nodes in the network N can grow at any polynomial rate, including both faster 
or slower than the number of observations n, or remain fixed. Assumptions |2]-|5] are rather mild. They are used to show that 
there will be no false negatives for sufficiently small A. In practice, signals are often normalized to have equal power across 
nodes, which automatically achieves |2l though only this weaker assumption is necessary here. The effect of normalization on 
the other assumptions, particularly |6] is an interesting open question. Assumption |4] essentially says that each time sample in 
the active set contains some independent information. Assumption |5] ensures that any influence due to the nodes in Si cannot 
be easily generated using nodes in Sf instead. 

Assumption |6] is the most restrictive and most informative. In the proof of the theorem. Assumption |6] is used to show that 
the probability of declaring a nonzero connection when none exists (i.e. a false connection or false alarm) goes to zero for large 
n. In order to understand the implications of the assumption, we point out a more restrictive, but less complicated alternative: 
X]fee5 ll^j.fclb < Cfcs < 1- If this inequality holds. Assumption |6] follows from simple norm bounds. The inequality also 
suggests the following interpretation of Assumption |6] Nodes that do not directly drive the node of interest (i.e., nodes in 5p) 
cannot be easily predicted from nodes that are directly driving the node of interest. In Section[V]we provide example networks 
that do and do not satisfy Assumption |6] to gain insight into the nature of which networks can be recovered. We show next 
that Assumption |6] is necessary for a large class of networks, including those of fixed size. 

Theorem 2: Suppose Assumptions |2H5] of Theorem [T] hold, but i'f^i > 1 + c for some pair and constant c > 0. 

Suppose also that m?p < n for large n. Then with probability exceeding 1 — cxp {—<d{n)), the connections recovered by SG 
will not be the true connections. 

Proof: A proof is given in Appendix |B] ■ 

Theorem |2] suggests that the false connection score is extremely important in sparse network recovery, especially in finite 
parameter networks, which are discussed below in Sec. IIV-AI 

The SCSG ^ assumes that each node is driven by its own past. The conditions of Theorem [T] with minor modification, 
still govern the ability to recover the correct connectivity pattern using SCSG: 

Corollary 1: Suppose Assumptions [T}{5] of Theorem [T] hold for all /. In place of Assumption |6l assume: 



FC 



< Cfcs < 1. (8) 



2 

Then with probability exceeding 1 — cxp (—8(77,)), the connections recovered by SCSG will be the true connections. 

Proof: See Appendix ICl ■ 
As we will show in the next section, is typically lower than i^f-^i, though cancellation between the self-connection 

term and other terms in the sum of d?) is possible. 



IV. Network Recovery 

In Section |lll] we established conditions which guarantee high probability recovery of SMART networks asymptotically, 
allowing the network size to grow faster than the number of samples. Next we explore the differences between the asymptotic 
setting and finite sample regimes. 



A. Recovery of Finite Parameter Networks 

In practice, the network parameters are typically fixed, and we are interested in performance as the number of measurements 
n grows. The results of Theorems [T] and |2] still apply. In the finite network case, m, p, and N are fixed, so {'m?p)/n tends to 
zero and Assumption [T] is satisfied as long as = 0{n^^) with < c < 1. Also, Assumptions |2]-|5] are automatically satisfied 
as long as there is driving noise in each node. Assumption |6] is the only one that does not necessarily hold. This implies the 
following corollary, which follows immediately from the proof of Theorem [T| 

Corollary 2: For a SMART model with fixed parameters, (|5]i will recover the correct network structure with probability 
greater than 1 — exp (— 0(n)) if i^f^i < 1 for all pairs G If i^f^i > 1 for some € 52^^, then ^ will fail to 
recover the correct structure with probability exceeding 1 — exp (— 0(n)). The same result holds for (|6]l using i'f^i- 
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B. Recovery of Known Networks 

Given Corollary |2] it is easy to check whether a given SMART model structure can be recovered via (|5) or (2). Define 
r(r) — E[x{t)x.^{t — r)], and recall S is the driving noise u(t) covariance matrix. If we define the collection of MAR 
coefficients A and S as; 

[Ai A2 ... A„ 



S = 



then T{t) can be calculated via (see e.g. ||4l) 



'■Af(p-l) 

S 








{p-l)N 



1)N 'J(p-l)JV. 



AFA^ 



(9) 



where 



r(o) 
r(-i) 



r(i) 
r(o) 



T{p-l) 
T{p-2) 



T{l-p) T(2-p) ... r(o) 
Using properties of Kronecker products, (|9) can be solved in closed form: 



vec (r) = (I - A (g) Ay'vcc S 



(10) 



Given this closed form expression for F, matrices Rsi,^; and R^. are formed for each node i by selecting the appropriate 
entries from covariance matrix F and subsequently used to calculate ^j^Si - Given ^j^Si and a^ ^ for all k £ Si, ipf^i or tpj'^i 
can be calculated and compared to one via Eq. Q or (|8), respectively. 



C. Challenges in Realistic Networks 

The theoretical basis for SMART model recovery relies on independent data samples and asymptotic probability concentration 
arguments. We now consider consequences of more realistic data sets. 

Our analysis focuses on the dependence accross columns of X and the corresponding entry of induced by the SMART 
model. To prove Theorems [T] and |2] we assumed each row of X and the corresponding entry of y; to be independent from 
other rows. This is not true in realistic networks where each X^ is actually Toeplitz; however, rows of X and y^ decorrelate 
as the time lag between them grows (E[x(t)x^(i — t)] « 0). The simulations in Sees. |VI| and [VTIl use correlated rows and 
reveal that the false alarm score has a more significant impact on performance than the row dependence. The effect of row 
dependence has been consider in the special case of first order (p = 1) AR models in [33], which yields a lower bound on the 
required number of observations. 

An additional challenge - and motivation for group sparse approaches - is the limited number of data samples available. 
Specific connectivity patterns in a SMART model of a real network may change over time, which limits the number of samples 
for which the network is approximately stationary. Analysis of the performance of (|5) or (|6) is difficult for limited data cases 
(finite n); however, the asymptotic theory and the simulations presented in Section |VT] suggest that when ipf^i is small, 
connectivity estimation is easier Also, weak connections (for which j|ai.j||2 is small) are more difficult to recover with limited 
data. For small enough A and large enough n, all connections will probably be recovered. When n is limited, the probability 
of recovering all connections, particularly weak ones, is decreased. 

Although Theorem [T] indicates how A should scale with n, selecting A for non-asymptotic regimes can be difficult. As seen 
in Section IVII A balances missed connections (Type II errors) with false positives (Type I errors). Ideally, one would select 
A to achieve a specified famlywise error rate or false discovery rate; however, calculating p-values of each connection for a 
given A is an open problem. 

Due to the difficulty of selecting an appropriate regularization parameter, it can be beneficial to consider the family of 
solutions achieved by varying A. The expectation-maximization (EM) algorithm described in ifTSl efficiently solves the SG or, 
with slight modification, SCSG problem over a range of A, successively adding connections as A decreases. In that work, a 
heuristic is used to select a single A from the family of possible solutions fT2l . In Sec. |VT]we use tenfold cross-validation 
to select the A which performs best on held out data. Another possibility is to apply a Wald test for Granger-causality ||4] 
successively to the last connection which enters the model and stop when a connection passes the test. A recently proposed 
stability selection technique combines lasso and randomized subsampling to provide subset selection with false discovery rate 
bounds 1341 . This technique could potentially be applied to the SMART model at the expense of additional computation. 
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D. Normalization 

Measurements from each node are often normalized to have equal power ifTSl . ||35l . We can account for normalization in any 
SMART model as follows. Equal power in all channels means the diagonal of T consists of all ones. Thus we can transform 
r to a normalized model using a diagonal matrix D^^/^ to obtain T = D^aTD^^. Eq. (|9]l implies: 



D-i AD* ( D SrD * ) D3 A^D 5 



+D 5SD 

Af A^ + E* 



where: 



D 



Di 
D2 










D 



N 



Here = aflp where of is the power in each node before normalization. 

The effect of normalization on the ability of group sparse approaches to recover network structures is complicated. We have 
found that normalization tends to decrease ijjmax = max^j j)(zs^'. ''Pj^i^ indicating an improvement in asymptotic recoverability 
(for fixed m, p, and N at least). On the other hand, normalization clearly alters connection strength, meaning some connections 
may be weakened due to normalization and difficult to recover in the finite sample case. 



V. Example MAR networks 

The false connection scores ipfJ^i and i/jf^i are the key quantities that determine whether SG or SCSG will recover the 
connections which influence node i. We consider four example networks in this section to develop insight on the nature of 
identifiable topologies. Figured depicts circular and parallel topologies constructed for this paper while Fig. |3]depicts networks 
that have been studied in previous literature (see llTjlFl) . We compute the false connection scores for both the original 

to determine whether the network is identifiable as \ 



network and after normalization (Sec. IIV-D 
The maximum false connection scores for each network are listed in Table HI 



00 for SG and SCSG. 
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(a) Circle Network 

Fig. 2. Contrasting example MAR topologies, self-connections not shown. 



(b) Parallel Network 



Each node in the "Circle Network" shown in Fig. |2(a)| is driven by it's own past as well as one other node forming the 
topology of a large feedback loop. We chose MAR order p = 4 and drew MAR coefficients from a normal distribution 
{J\f{0, 0.041)). The first realization which resulted in a stable network is selected. The maximum false connection scores for 
this network are ipf-^i ~ 0.47 and i/jf-^i = 0.43. Since these are less than one, the network connectivity can be recovered (as 
n -> 00) using both SG and SCSG. 

The parallel network (Fig. |2(b)| i connectivity structure and coefficients were selected deliberately to confound group sparse 
approaches. We chose 3.2^2 = [.2 .2 .2 .2]"^ and ai^i = [.05 .05 .05 .05]-'" for i ^ 2. All other connections shown are given 
by SLi^j — [15 .15 .15 .15]^. This network highUghts several important aspects of SCSG, so we explore it in some detail. The 
false connection scores for this network are summarized in Table [III 

'in 1131 the direction of causal influence is unclear. The network structure is described by a matrix of ones and zeros, but it is unclear whether a one in 
the (j, j)*'' position represents a connection from i to j or vice versa. We show one possibility here and note that the other possible network (not shown) 
has similar properties. 
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(a) Winterhalder Network (b) Haufe Network 



Fig. 3. MAR network topologies from existing literature. 

TABLE I 

Maximum false connection scores. 



Network 


Original 


Normalized 


T max 


'rmax 


'rmax 


'rmax 


Circle 


0.47 


0.43 


0.47 


0.43 


Parallel 


1.93 


1.06 


1.04 


1.03 


Winterhalder 


0.46 


0.29 


0.24 


0.15 


Haufe 


0.83 


0.56 


0.71 


0.57 



No matter which approach is used, a false connection from node 2 to node 1 will be established with high probability as 
71 ^ oo. This is due to the fact that there are four parallel paths connecting node 2 to node 1. Since node 2 has such a strong 
combined influence on node 1, group sparse approaches are likely to identify a direct link. False connections from node 1 
to nodes 3-6 are also likely for large n when SG is used. On the other hand, the probability of linking 1 to 3-6 goes to 
zero as n increases if SCSG is used. This illustrates an important characteristic of SCSG: the asymptotic likelihood of false 
connections from a child to a parent tends to be reduced when self-connections are not penalized. Proving this is always true 
seems difficult, but we provide some rationale. The difference between ikf^i and "i^f^i is the term ' norm 

lies between the singular values of the square matrix '^j.i- While it is difficult to verify that vector a^^i lines up with a strong 
left singular vector of '4'j,i, we can expect that ^ j_i will be "large" relative to other j^k since there is a connection from i 
to j. 

The false connection score from node 1 to node 2 in Fig. |2(b)| highlights another important (and related) feature of SCSG. 
The probability of falsely identifying connections to any node t which is only influenced by its own past goes to zero as n 
goes to oo since V'jlYi is always zero. 

The parallel network example also indicates that additional, unconnected nodes (i.e., node 7) do not change the false 

TABLE 11 

False connection scores for Parallel Network. 



Connection 


Original 


Normalized 










1 -> 2 


1.41 





0.74 





2 -!> 1 


1.06 


1.06 


1.04 


1.03 


1 -> 3 
1 -> 4 
1 -5> 5 
1 -> 6 


1.93 


0.71 


1.00 


0.37 


3 -> 2 

4 -> 2 

5 -> 2 

6 ^> 2 


0.61 





0.63 
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connection scores of connected nodes. The chance of a false connection will increase in the finite n case, but asymptotically 
such additional nodes do not matter since, as n grows, the estimated correlation between two unconnected nodes will go to 
zero. 

The network in Fig. |3(a)| (see 13]) is not only group sparse, but sparse as well; every connection but one (self-connection 
of node 4) consists of only one coefficient at one time lag, as shown by: 



Xi{t) 


= 0.8xi{t- 


1) 


X2{t) 


= 0.6x2{t- 


1) 


X3{t) 


= 0.5a;3(t- 


3) 




+U3{t) 




X4{t) 


= l.2xi{t- 


1) 



0.65x2{t-4)+ui{t) 

0.6.T4(t-5)+M2(t) 

0.6.Ti(t- 1) +0.4x2(t-4) 
0.7x4,{t - 2) + U4,{t) 



As shown in Table H this network is recoverable by either method. 

The structure of the network shown in Fig. |3(b)| is taken from Fig. 1 of |131. As in [131, we draw coefficients from a 
Af{0, 0.041) distribution and check for stability. This network, which includes multiple paths of influence and feedback loops, 
can be recovered via both SG and SCSG with high probability as n increases. 

VI. Simulations 

We now simulate the circle and parallel networks depicted in Fig.|2]to illustrate SG and SCSG network recovery performance 
with finite n. (Simulations of the Haufe and Winterhalder networks performed similarly to the circle network and are omitted for 
space.) Signals were simulated via ([TJ with the initial condition for each simulation determined from the steady state distribution 
and with white driving noise of equal power in each node. The expectation-maximization (EM) algorithm described in fl2\ is 
used to solve the SG and SCSG optimization problems for A e [0.05Xmax, ^max], where Xmax is the minimum A such that 

= 0. A specific A is selected separately for each node via tenfold cross vaUdation using prediction error on held out data. 
We assume the correct model order p is known. Thirty realizations of each network are generated with n = 150 time samples. 
We count the percentage of the 30 trials in which the true connections are correctly identified as well as the percentage of 
trials in which nonexistent connections are incorrectly identified. 

The results for SG and SCSG applied to the circle network are illustrated graphically in Fig. |4] The true connections are 
identified in most of the cases for the circle network. The strength of the four connections are given by ||a2,i||2 = 0.46, 
||a-3,2||2 = 0.30, ||a4^3||2 = 0.37, and ||ai^4||2 = 0.28. The two true connections that are most often missed are the weakest 
connections of the four (2 — )• 3 and 4 1). The SCSG approach identifies the connection from 2^3 considerably more 
often, however. The most common false connection with SG was from node 1 to node 4 and occurred in only 2 of 30 trials, 
while a false connection from node 4 to node 3 was identified in 4 of 30 trials using SCSG. Qualitatively similar results are 
obtained for n = 50 and n = 100 with the performance improving for most connections as the number of samples increases. 
A noticeable improvement in ability to identify true connections results as the number of samples increases from n = 50 to 
n = 150. 




(a) Circle Network (b) Circle Network 



Fig. 4. Infening the circle network using SG and SCSG with cross validation from n = 150 time samples. Black Unes and numbers illustrate true connections 
and the percentage of 30 trials in which they are correctly identified. Red dotted hues and text identify the most common false connection and percentage of 
occurrence over 30 trials. 
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(a) Parallel Network (b) Parallel Network 



Fig. 5. Inferring the parallel network using SC and SCSG with cross validation from n = 150 time samples. Black lines and numbers illustrate true 
connections and the the percentage of 30 trials in which they are correctly identified. Red dotted lines and text identify the most common false connection 
and percentage of occurrence over 30 trials. 

As predicted by the theoretical arguments of Sec. IIV-AI the SG approach does not perform as well on the parallel network 
(Fig. |5]). In particular, the true connections from nodes 3, 4, 5, and 6 to node 1 are never identified, the true connections from 
node 2 to nodes 3, 4, 5, and 6 are identified about half of the time, and the connection from node 1 to 6 is incorrectly identified 
in all cases. The next most common false connections (not shown in Fig. |5]) are from node 1 to nodes 3-5 with probabilities of 
93%, 83%, and 87%, respectively. These four false connections (from node 1 to its parents) have the highest false connection 
score {tP(C^ = 1.93, j = 3,4,5,6) for this scenario, according to Table HI] The false connection from node 1 to node 2 is 
the next most common, occurring in 80% of the trials. The false connection score for this link is 1.41. Notice these five most 
common false connections reverse the true direction of causal influence. 

The SCSG approach performs considerably better for the parallel network, consistent with the improvement in the false 
connection scores given in Table HI] The connections from node 2 to nodes 3-6 are almost always discovered, although the 
true connections from nodes 3-6 to node 1 are missed more frequently. However, SCSG identifies a connection directly from 
node 2 to node 1 in 70% of the trials. A possible explanation for this error is that a single connection from node 2 to node 1 
is a sparser solution than connecting nodes 3-6 to node 1 and accounts for much of the variance at node 1. The connection 
from node 2 to node 1 has the highest false connection score (see Table |II]i. 

When using SG on the parallel network, none of the true connections to node 1 are identified. While these connections might 
be recovered by allowing a greater range of A in the cross validation selection procedure, their absence reveals a downside to 
penalizing self-connections. As A is decreased below A*, the first connection identified is the self-connection. When SCSG is 
used, self-connections are always present, so decreasing A below A* activates a connection from a different node. In a sense, 
the SCSG approach has a "head start" in detecting connections. 

Simulations with n = 50 and n = 100 time samples (not shown) reveal that the ability of SCSG to recover the true 
connections improves as the number of samples increases. However, the number of trials in which false connections were 
made between nodes 1 and 2 (both directions) also increases as the number of samples increases. This behavior is consistent 
with the asymptotic result of Cor |2] which indicates that the probability of identifying the wrong network goes to one as the 
number of samples increases. 

VII. Macaque Brain Simulation 

Lasso-type procedures have recently been applied to MAR model estimation of brain activity lfT3l . ||3T1 . Ii36l . 1371 . In 
this section we simulate electrocorticogram (ECoG) recordings with a SMART model using a realistic network topology 
obtained from tract-tracing studies of a macaque brain ll38l . 1391 . A matrix representing connectivity in the macaque brain - 
the "macaque71" data set, consisting of 71 nodes and 746 connections - is shown in Fig. |6(a)| Each node is an area of the 
cortex. A connection between areas exists if neuronal axons physically connect respective areas. Figure [6(a)] suggests a sparse 
connectivity structure in the macaque. Including self-connections, there are an average of 11.5 out of 71 possible parents for 
each node. 

We simulate two networks based on this physical connectivity structure. First we assume that every physical connection 
in the macaque71 data set is actively conveying information. It is unrealistic to model every physical connection as active at 
a given time, so we also simulate a model in which up to ten randomly selected parents (including the self-connection) are 
active for each node. For simulation purposes, we choose a model order of six and draw coefficients for nonzero entries of the 
Ai matrices independently from a A/^(0, 0.041) distribution for the full model and a A/^(0, 0.161) model for the subset model. 
The first reahzation for each model that results in a stable network is used. 
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(a) Full Network (b) Subset Network 

Fig. 6. Connectivity matrices of the simulated macaque brain networks: (a) all physical connections are active and (b) up to ten parent nodes are active. A 
connection from node i to node j exists if the entry in the i^^ row and j'*'' column is black. 

Given these stable SMART models based on physical connections in the macaque brain, we generate time series using Eq. [T] 
with initial conditions Xi{0) = and driving noise Ui{t) distributed i.i.d. M{0, 1) over all channels and all time samples. The 
data are normalized, as described in Sec. lIV-Dl using the estimated power at each node. 

Normalization reduces the worst case SCSG false connection score of the full network from 1.73 to 1.25. Hence the SCSG 
estimate will be inconsistent as the number of samples increases. Note however, that SCSG can still consistently recover the 
parents of nodes i for which ipf^i < 1 for all j G S^. In this example, only four nodes i have 't/jf^i > 1, meaning that the 
parents of 67 of the nodes can be recovered accurately. Interestingly the neighborhoods of the four nodes which violate the 
false connection score condition exhibit a topology very similar to the parallel network described in Sec. |V] Each of these four 
nodes has many parent nodes which provide an indirect link to the same "grandparent" node. If only some of these paths are 
active at a given time, the network may be recoverable. This is indeed the case in the subset model where the false connection 
score is reduced from 4.07 to 0.54 by normalization. 

We illustrate the performance of several network estimation techniques in Fig. |7] using receiver operating characteristic 
(ROC) curves. We simulate the SCSG approach, the standard Lasso which promotes sparse coefficients as opposed to sparse 
connections (see Sec. |Vll, least squares estimation (Yule- Walker equations for n > pN), ridge regression, and an approach for 
estimating sparse non-causal networks described in [ITsl which we call the Meinshausen and Biihlmann (M&B) approach. The 
poor performance of the M&B approach illustrates that it is not appropriate for causal network inferenceH The performance 
of the SG technique is similar to that of the SCSG for these networks, so we do not include it here. 

Using ROC curves to evaluate performance removes the difficult task of selecting regularization parameters (which relate, 
sometimes directly, to significance level) for different techniques. The ROC curve is obtained for the SCSG, Lasso, and M&B 
approaches by varying the penalty weight A (using the same solver with group size of one when necessary). A detection 
occurs when a nonzero estimate a.; coincides with a true connection from node j to node i, while a miss occurs when 
kij = despite a true connection from j to i. False postives and true negatives are similarly defined. For least squares 
and ridge regression approaches, we use the simultaneous inference method proposed in ifTJl which makes use of adjusted 
p- values ||40l; however, we threshold the normalized test statistics directly (rather than the p-values) to produce ROC curves 
in order to avoid compuationally intensive Monte Carlo sampling of multivariate integrals. This yields the same curve due 
to the monotonic ralationship between test statistic and associated p-value. Since SCSG has additional knowledge that all 
self-connections are non-zero, we do not include self-connections when calculating ROC curves for any method. The ROC is 
defined as the percentage of true connections detected versus the percentage of false positive connections. 

We simulate both n = 300 and n = 900 time samples from all 71 nodes. In the first case we have fewer samples 
(300 X 71 = 21300) than coefficients (6 x 71^ = 30246), so enforcing a sparse solution is essential. This is clearly seen in 
Figs. |7(a)| and |7(c)| where SCSG and Lasso clearly outperform the other methods. In fact, least squares, ridge regression, and 
the Meinshausen and Biihlmann approach perform similarly to coin flipping. The SCSG performs better than the Lasso because 
the group assumption of the gLasso better matches the true model. In the second case with n = 900 time samples for each 
node, we have a few more than two samples for every coefficient. The results are shown in Figs. |7(b)| and |7(d)| Both SCSG 

^Readers familiar with 1181 will observe that the M&B technique is not meant to recover nonzero connections as defined here, but rather nonzero entiies in 
the inverse covariance matrix. We point out that although the MAR networks presented here are sparse in the number of parent nodes, the inverse covariance 
matrices are not sparse. 
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(c) Subset Network, T = 300 (d) Subset Network, T = 900 

Fig. 7. Fraction of connections identified vs. fraction of zero valued aij misidentified as nonzero (ROC curve) in simulated macaque brain networks. Top 
row: all physical connections active. Bottom row: subset of connections active. 



and Lasso perform better with more samples, as expected. The other methods still perform similarly to coin flipping. In the 
case of least squares and ridge regression, there are still too few samples to reUably estimate the covariance matrices. 

VIII. Conclusion 

We have analyzed application of the Group Lasso to the SMART model and proposed a modified gLasso for SMART model 
estimation. The gLasso groups together all p coefficients which comprise a connection from one node to another and penalizes 
the sum of the £2 norm of these coefficient groups. Such an approach tends to yield estimated networks with only a few 
nonzero connections. Our proposed SCSG removes the penalty for self-connections so that a node's own past is always used 
to predict its next state. We have shown that both the SO and SCSG approaches are capable of recovering the true network 
structure under certain conditions, the most crucial of which we term the false connection score, ipmax- MAR networks are 
identifiable when t/jmax < 1, but not when t/jmax > 1. To our knowledge, this is the first attempt to quantify the characteristics 
of MAR networks that result in gLasso based recovery. 

The false connection score condition (and to some degree Assumption lU implies that the network under study must be not 
only sparse, but also have the property that each node in the network is independent enough from other nodes (then 'if ij will be 
small). Clearly, a network with only self-connections satisfies this condition, but these are not very interesting or realistic. On 
the other hand, small world networks BTI have the type of structure that seems likely to meet the false connection condition 
(again depending on the connection coefficients). In small world networks, each node is connected to most of its nearest 
neighbors, but also has a few long range connections (short path lengths). It has been shown that such networks efficiently 
transmit information to all nodes BTI . B2l and suggested that the brain may have a small-world network structure. In fact, 
the structural connectivity pattern of the macaque brain used for simulations in Sec. IVII represents a small-world network 1391 . 
1431 . Small-world networks have sparse structure, though each node may have a somewhat large number of local connections. 

The false connection score indicates whether a false positive connection is likely to occur False negatives or missed 
connections are also of concern. Our analysis shows that, for fixed parameter networks (m, p, and constant), the penalty 
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weight A can be set small enough that false negatives are improbable. The false connection score determines whether this 
small A will avoid false positives. Our experience suggests that misses are more likely to occur for weak connections. Our 
examples indicate that the SCSG approach is effective at recovering network structure and that the false connection score is 
a an informative indicator of recovery performance for even relatively small sample sizes ??. Finally, note that the result of 
Theorems [T] and 12] apply to any gLasso application which satisfy the assumptions. In a generic application the false connection 
score may be interpreted as a statistical property of the X matrix. 



Appendix A 
Proof of asymptotic consistency 

To prove Theorem [T] we consider applying gLasso dS) to a single node (without loss of generality, node 1), and use the 
union bound to achieve the desired result. We restate Assumption[T]in terms of positive constants ci - C4 to facilitate the proof: 
number of nodes N = 0{n'''^), maximum number of parent nodes m = 0(71^^), model order p = C'(n'^^), and regularization 
parameter A = Q{n^'^^^^) with C2 < C4 and C3 + C4 < 1. 



KKT Conditions 

The Karush-Kuhn-Tucker (KKT) conditions for a solution to O follow from the theory of subgradients. The subgradient 
of ||v||2 is any vector whose ^2-norm is less than one for v = 0, while it is simply the gradient ||:^ when v ^ 0. Thus the 
KKT conditions are given by: 



Xf(yi-Xai) = -1-^ Vzs.t.ai,,^0 (11) 
At? 

||Xf(yi -Xai)||2 < — Vzs.t.ai., = 0. (12) 

For convenience, we define Zi [2^,1 ••• ^T.n with Zi_i = ^Xf (yi — Xai). The vector zi restricted to the active set is 
denoted zg^. We assume without loss of generality that iSi = {1,2,..., m}. 



Limiting False Negatives 

We start with conditions assuring that all nonzero coefficients are estimated as nonzero. To do so, we follow the arguments 
used by ll28ll . We consider the "oracle" solution; e.g., we consider the solution to the group sparse penalized estimator if the 
active set were known: 



aJ(A) 



arg mm — 



arg mm — 

OLs, n 



yi - Xaill 

yi - [^si 

ai.ilb- 



i=l 







2 



(13) 



(14) 



We must ensure that all coefficient subvectors in Si are nonzero in the oracle estimate a^; . Since all subvectors a^; ^ of a| will 
be zero for large enough A, this means we must make sure that A is not too big. 
All nonzero blocks must satisfy (fTTT i. so we consider: 



—£51 = X^^(y-Xa*) 

= O^Si a^i + ui - X^j kg^ ) 

= (X^^X5,(a5, -aJJ+X|^Ui (15) 

from which we obtain: 

\n 

= ^5. - — (X|,X5j-iz5, + (X^^X^J-^X^^ui (16) 

where the invertibility of X^^ X^^ is assured for large n since ?? grows faster than mp by Assumption [T] At this point the 
following notation is convenient. Let G^^ = n(X^^X5j)^^, with columns partitioned as G^^ ~ [^si 1 ■■■ Gsi .„ ], where each 



14 



sub-matrix is mp x p. Since n ^X^^X^^ is an empirical covariance matrix (maximum likelihood estimate of R^i^^j), we 
denote the true inverse covariance matrix of signals from the active set by G^^ = R-^i^^j^ ■■■ '^■si,™ ]. 

To show that each subvector ^ 7^ for i £ Si in the limit, it suffices to show that ||G^^ ii^^Si ~ n-^5i^i)ll2 < C'con < 
ilb- Applying the triangle inequality, we instead show that ||G5^(^Z5j — iX^^Ui)||2 < Ccon with the following lemma. 

Lemma 1: Given Assumptions [T}{5] ||G^^(-|z5j — iX^^Ui)||2 = ©(max (n^r^ , n'^^+^~2* with probability 

exceeding 1 — cxp (— 8(?i)). 



Proof: \},ing ||G^^ (fz^, - lX^^Ui)|| 



G^^X^^Ui||2, we bound the two terms separately. First: 




< 



0(„(c2-C4)/2) ^ (^^^c.-c4/2+C3/2-l/2) 



where W ^ M{Q,Y). The second inequaUty is simply the triangle inequality applied to z^^ since ||zi.i||2 = 1 for all i £ Si 
and each node has no more than m parents. The second to last inequality holds with probability greater than 1 — cxp (— 8(n)) 
). Given the conditions on constants C2-C4, the last line goes to zero. 
Next, consider: 



^||G^,X^^Ui||2 = ||(Xi^X5j-iX^^ui||2 



(x+ruiii2 



< -!- 

n 



1/2 

G, 



IU1II2 (17) 



where X^^ denotes the pseudoinverse and Ui ^ A/^(0, cr^I) since we have assumed independent time samples. Inequality (fTTI ) 
can be easily seen by considering the singular value decomposition of X^^. Obozinski et al. [|28i provide the following bound 
for the inverse sample covariance matrix: 



IG5JI2 < 2C-i„j >l-2exp(-e(n)) 
and 1441 provide a bound for the chi-square variate: 

P flluilla -n> 2Vnt + 2t) < cxp{-t) 



which holds for any t > 0. In particular, ||ui||2 < 5n for t = n with probability exceeding 1 — exp(— 71). Combining these 
bounds with (fTTI ) and Assumption [2] gives us: 

i||GLx|.u,ii.<2^v^-'^(;^ 

with probability greater than 1 — 2 cxp (— 0(n)). 

■ 

Since both terms of ||G^^(4zi — iX^^Ui)||2 go to zero as n grows, their sum will be less than Ccon with high probability 
for large n. This implies that each ||G^^ (f^i ~ ;^X^^Ui)||2, i £ Si will also be less than Ccoji, so for all i G Si, 

||at,J|2>|lai,,||2-Ceo„>0. 

We have shown that ■ 7^ for each i £ Si with probability greater than 1 — cxp (—8(71)). We next show that the oracle 
solution is in fact the overall solution with high probability. 
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Limiting False Positives 

Assuming that the oracle solution from (fTsT i has all nonzero subvectors we must ensure that a* = [(aj^)'^ o^]"^ is a 
solution to the full problem with high probability. In other words, we must show that ^||Xj(y — Xa*)||2 < 1 for all j £ 
To do so, we adopt a technique used in [18]. Write = X^ieSi + where 















= arg min E 













X, - ^ X,*, 

and Vj is a random variable representing the portion of Xj that can't be predicted by X;, i G Si. Now we have: 



(18) 



^||Xj(yi-XaD||, 



2 

2 

An 
2 

Xn 



(Y.X.,^,,,+y] (yi-Xan 
^ *j:,Xf (yi - Xal) + Vj(yi - Xa^) 



ieSi 



5:*J,^ + fvJ(yi-XaD 

^ ■' \\a* ,||2 A?i 



< 



ieSi 



lai,,ll2 



|ai.i!l2 l|ai,||2 



^||vj(..-x.;,| 



(19) 



(20) 



(21) 



where ( |20] | follows from the KKT condition ( fTTT ). The second term of ( 1211 1 is less than one by Assumption |6] We bound the 
remaining terms separately. In order to bound the first term, we use the following lemma: 
Lemma 2: — rhr < '^^^'u iT^^ 

|v|| ||w|| ||w| 

Proof: 



V 

Ml 



w 

M 



< 



< 



< 



||W 

1 1 



|V|| ||W|| 

|-||v||)| 



2||v-w|| 



w 



|w| 
V — wl 



We now bound the first term of (ISTT l: 



T 

hi 



|aL,||2 ||ai,^||2 



< 1I*.aI|2 I E 

< II*.aI|2(E 



ai, 



1/2 



l|al,J|2 ||ai,i||2 

oil-* l|2\l/2 

2||a^.,-ai,,|| \ 
llai.Ji i 
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where we have applied Lemma|2] From Assumption|3]we have ||ai.i||2 > Ccon- Using this and II 2 = llR-^j^^jH^l^^iXjllU < 

l|R-5iA^5i,5f II2 < CrnaxC'l^, we have: 




= C'(max(n ^2 ''^n^^"'' ^2'' ,n 2)) 



where the last inequality follows from ([16) and Lemma [T] 

Finally, we show that the last term of (ISTT i goes to zero faster than 0(ri'^^3+^^^^)/^). Since they are linear combinations of 
zero mean Gaussian random vectors, the p columns of Vj as well as vector y — Xa* are Gaussian. Though these p+1 vectors 
will be correlated for most interesting networks, the entries in any one of these vectors are i.i.d. Gaussian with variance less 
than Cpotuer- We establish the following lemma. 

Lemma 3: Let V be an n by p random matrix and w an n dimensional random vector For each i = 1, 2, . . . , n, let the 
i*^ row of V concatenated with the i*'^ entry of w be i.i.d. Gaussian vectors with distribution J\f{0, C), for some covariance 
matrix C whose maximum (diagonal) entry is Cm- Then with probability exceeding 1 — pexp{~n), ||V^w||2 < Cmy/5np. 

Proof: The entries in any column of V are i.i.d. Gaussian with variance less than Cm, as are the entries of w. With this 
in mind, we bound each entry of z = V^w by C,„ times a chi-squared random variable with n degrees of freedom (denoted 
Zi ^ Xn fo'" * — 1: 2, . . . ,p) and use the union bound: 



¥{\\z\\l>Cl5np) < P{\\i\\l>5np) 

< pF {Sf > 5n) 

< pcxp(— n) 

where we have used the same chi-squared bound as in Lemma [T] Thus with probability exceeding 1 — pexp(— n), |1 V^w|l2 < 
Cm. \/5np. 

■ 

Using Lemma|3] we have with probability exceeding 1 — pexp(— n), |lVj(y — Xa*)|l2 < Cpo-wer\/5np. Dividing by An 
and using Assumption [T] we have: 

^\\Vj{y - Xa*)||2 < '^P^^p/^ = 0(n(=^+^^-i)/2). (22) 
An A-^/n 

By ( fT2] i. there will be no false positives if ( I2TI 1 is less than one. With high probability, the second term is less than C/c < 1 
by Assumption |6] The first and third terms go to zero with large n with high probability. 



Union Bound 

We have shown that (|5]l recovers the correct parents of node 1 (set Si) with probability exceeding 1 — exp(— 8(?i)). To 
obtain the result for the whole network, we apply the union bound: 



N 



U 5. ^ 5, 



N 



i=l 

< 7Vexp(-e(n)) 

< n'=i exp(-e(n)) 

< cxp (ci Inn — 0(n)) 

< exp(-e(n)) 
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Appendix B 
Proof of Necessary Condition 

We must show that ^ will not recover the correct set of nonzero ai^i when Assumptions |2}j5]hold but J2ieSi "^J.i \ \ai'] 
1 + c. We do so by contradiction. 

Suppose A scales with n such that all the coefficient blocks in Si of the oracle solution are nonzero and the probability of 
false positives goes to zero as n grows. Then KKT condition ( IT2l) must hold with high probability for large n. This implies 
the following bound must hold with high probability for all j e Si ': 



> 



A 

- > 
2 - 

A 
2 



> 



> 



||Xj(yi-XaD||2 

ies 



l,tll2 



A ai 
-n-^ ||V[(yi - Xat 



ai,, 2 



,^Vj(y,-Xa- 



^ll*Jw|U 



A 



(1 



MlVnyi-Xanl 



(23) 



where w = [wi...w„]'2" and wi^^ = ^ 



From Eq. (ESji we have ||Vj (yi - Xa^;)!!^ = 0{^Jp/n), which 



llai.ill y - II - J ^-^ i'll2 

goes to zero since p/n < mp/n, which goes to zero as n goes to infinity by assumption. We have also shown that 
goes to zero; however, this term is now multiplied by ^ for some unknown A scaUng. To proceed, Eq. (|23] l implies: 

cA 



A 

y<2 



+ OWp/n) 



Since the second term goes to zero, this implies: 



C< |i*Jw||2 < l|*j||2l|w||2 < 



Cn 



m max w. 



i|l2 



where the last inequality follows from the definition of ^ j and the triangle inequality. This means there is at least one i ^ Si 



for which 



> 



cC„ 



ll"l ill-i II "-J- t'I II o \ '"'^m 

Now we use Assumption Qfand ( fTSl i: 

2a/ TflCrnax 



Combining this with Lemma (|2]) implies that ||ai.i — ai.ij|2 > 



cC^.^I|ai..|| 
2VmCm„^ 



< 
< 



cC„ 



ai,i||2 



2\TriCyyiaX 

ai,j - ai_ij|2 



< 



< 



AA/m 




where the last inequality follows from the proof of Lemma [T] Since mp/n goes to zero, we have the following lower bound 
on A: 



(24) 



•mC 

Since ai ^ ^ for at least one i by assumption, KKT condition ( fTTI) . repeated here for readability, must hold for at least 
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Xf (yi - Xai) = -^^^ Vis.t.ai,, ^0 (25) 

Using Lemma [3] (with V = X; and w = yi — Xai), the norm of the left hand side of (IZST i is less than Cpowery/^np with 
high probability for large n. On the other hand, (l24l i implies that the norm of the right hand side of dZST l is Vt{n/m). Given 
that n grows faster than m?p, this is a contradiction. 

The scaling law n > rri^p for large n (equivalently 2c2 + C3 < 1) was not required to prove asymptotic consistency. Other 
proof techniques may result in matching scaling laws. 

Appendix C 
Proof of Corollary[T] 

The proof is the same as that of Theorem [T| with a few minor changes. The KKT condition ( fTTI) for I = 1 becomes 
^^(yi ^ Xai) = 0, which implies Zi i = 0. The results of Lemma [T] still apply with m replaced by m — 1 in the proof. In 

App.S V.f_^i = *L A^IL '™P^y '^P^^'^^'^ "^"^ ^^^^ = T.^esu^^l , Since Xf (yi -Xai) 

instead of 



2 



2IIS1 
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